Challenge

Amsterdam renting prices

Remy Alashabi

1. Introduction

Amsterdam is the capital city of the Netherlands, known for its stunning history, the city has a standing legacy since the 17th century (golden age). Amsterdam has a population of 821,752, as the population grows, the numerous possible factors that affects the rent are also growing. A few examples of this are area of residence, amount and size of bedrooms, bathroom facilities and internet access. The sheer number of different marketplaces for housing are increasing rapidly. On top of that, these marketplaces do not all work the same way and may have varying features.
Therefore, companies like funda and kamernet were developed but, those companies display only the houses available, and sometimes these houses do not have the desired features. Thus, the user can not know the estimated price of a house, due to it not yet existing or not yet being available. In addition to that, if manual estimations of the house price were to take place, that would take quite the time, and it could be hours or days. Therefore, using Artificial intelligence to predict the rental price of a house in Amsterdam, based on the important features is valuable for homeowners or real estate agencies. Hence, no matter what the features of the house are, type of housing or location, the agency is able to provide the information easily. In this project we will be looking at a dataset about the renting price in Amsterdam where I will check the rent prices of the houses based on important factors such as area, bedrooms, toilet, internet, etc.

In the current situation an individual could use multiple websites with different selections of available housing to find a potential future home. The sheer number of these different marketplaces are increasing rapidly. On top of that, these marketplaces do not all work the same way and may have varying features. This can make it hard for somebody looking for a house to make a financial calculation regarding what they will be paying in rent. Especially considering certain homes which are not available right now might become available, while in the meantime not being listed on a website. This can make it hard to accurately predict what one could be paying for future accommodation, making it hard to do proper budgeting, both for new tenants and existing tenants who might want to move into a bigger house.

These factors combined with the relative lack of knowledge among international newcomers about the rules and rights of renting in the Netherlands lead to another possible gap within the market. (I Amsterdam, 2020) People who fall into this category can be easily preyed upon and exploited by potential landlords through for example the state of delivery, extra fees, security deposits and other hidden costs which may or may not be illegal in nature.

1.1 Goal and Questions

The goal of this project is to forecast the housing rent prices for potential local tenants as well as international newcomers with respect to their Budget, financial plan, and preferred features.

Can you help new coming students to identify the base renting price of negotiation and identify scams given house price prediction?
• What would be the estimated rent price for a tenant in Amsterdam given a housing with specific coordinates, housing square meter and tenancy agreement (i.e., shared, or individual)?
• What would be the estimated margin of the price based on the predication?

Furthermore, more EDA and analysis will be done below.

2. Data requirments, collection and tidying

before we start lets sttudy the datasets. The datasets are crawled from the websites huurwoningen and kamernet. First we will be cleaning the data from kamernet and then the one from huur and merge them.

The kamer data set 62 columns and 46722 rows. Most of these columns are non-necessary data such as: link, ID or match gender. So, to stay in our scope I chose some columns which I think are necessary and they are:
areaSqm Area of the house in square meter
city: city
longitude: the longtitude of the house
latitude : the latitude of the house
toilet: if toilets is owned or shared
shower : if the shared or owned
kitchen: if the kitchen is shared or not
living : if the living space owned or shared
propertyType : if the property is an apartment or a room
rent : rent price in euro
Postalcode : zip code of the house

And the data from huur woning has 9 columns with 1998 rows and they are :
url Title : conatins the city name
postcode: zip code of the house
rent : rent price in euro
area : Area of the house in square meter
tye of property : if the property is an apartment or a room
consturction year : year where the house was built
rooms : number of rooms in the house
bedrooms : number of bedrooms

BEfore we start with the EDA we would like to tidy the data and make sure it is clean, so that relaiable and accurate information can be extracted.

2.1 kamernet

in 2.1 we will clean the kamer dataset

Now we get to choose the columns needed for our model in a new data frame. AFterthat we check for null values and check the data if it was done with crawling data or not.

As it can be seen there are a lot of missing values in this data set. How are we going to approach this? If we looked further in the data we will notice that some data are not filled in or called “unknown” and If we look at the crawlstatus columns we will see that it consists of uncrawled and crawled data. So let us drop the unavailable rows which crawling couldn’t get in this column and change the values of the missing data to NAN.

If we look at the data now we notice that the number of NAN values have increased, and that is because we replaced any empty strings or unknown strings to a nan. how do we still deal with this problem. Lets try filtering the city for Amsterdam only and check again. The reason why we filter for amsterdam through a string that contains the word "Amsterdam", is because we are covering the whole area of amsterdam, so there might be places that are called different but still in amsterdam for example, "Amsterdam zuid".

As it can be observed the NAN values have decreased. So, time to impute. Most of these data are missing by random, as they can be predicted using other variables and that is due to them not being independents. I would choose to impute using the mode as it is the most frequently repeated value For every column and the reason behind the data missing is due to when crawled not all information were retrieved

This is a different way of visualizing the missing data in our dataset. As it can be seen between all the missing columns there is a similar pattern in the missing values, where they share the same distribution of missing data. If we look at the spark line on the very right we see that there are 2 numbers 8 and 12. This tells us that there are minimally 8 columns with full data and maximum of 12 columns which are less filled with data.

The reason behind why we impute with the mode, is because when i did some research the data wasnt compelety fully crawled due to the website of kamer being protected by json files which prevents crawling. Therefore, it was observed that there are houses with shared toilet, shared kitchen and shower, but living is missing as NA. Based on that It was decicded to impute witht he mode. In addition to that, since that the data is categorical and missing at random the best apprpoach is to impute with the mode. Unlike the others which needs the variable to be numerical. The limits of this decision is that the results might be biased.

2.2 huurwoningen

Here we will be cleaning the dataset from huuurwoning

Lets first start cleaning the data before changing the data type. we will start with rent as the column rent first index has a euro sign and the last 9 indexes conatins useless charachters which needs to be removed and that is done below.

The title column will have to change to city name as this column contains the title of the house and the city name. Since that we only need the city name we will have to strip all other charachters and keep only the city name

As it can be seen we have succeded in stripping all the charachters except the last 9 which has the same count as amsterdam. now we have to filter for amsterdam city and check if there are any other values that we missed

now we move to area as we have to strip the m2 and the location as we only want the zip code

In this moment as i tried to change the columns type i got some error messages about the format of m column. After some investigation on how it was crawled, apparently it wasnt clean as it contains some weird characters "€ ", which arent displayed or cant be seen above due to panda library. so i had to inspect the webpage where i crawled my data and saw that these characters are presented before the price or area.

Below we start imputing the NA and for houses with 1 room it is logically to have 1 bedroom too and if we checked on the website, it doesnt include the bedrooms as the bedroom is included in the room.

It seems like there is a pattern here as it can be seen above. If we have n number of rooms then we have n-1 number of bedrooms.
an example could be a house with 5 rooms has 5-1 bedrooms. following this logic, it seems logial to impute every missing value with n -1.

Now we look at the appartment type and study the missing data

Looking at the appartments type, the data isnt missing due to the crawling, but it is actaully missing from the website it self. AS a proof of that look at the table above with ID 129 and compare it with the given website"https://www.huurwoningen.nl/huren/amsterdam/1282353/esplanade-de-meer/" and that is considered as missing at random data.

2.3 Enhancing data set

In here we will try enhancing the dataset from kamer with the data set from huurwonning and funda. So that we can enrich the dataset with more data, so that we can have a better accuracy.

IT is noticed that as we enhanced the dataset with the number of rooms and bedrooms we reduced the amount of our data from 8000to 99 rows. Lets try to look at a third dataset and check if we can increase the amount of rows.

I tried to merge the original data set with a new data set from huurwoning, to enhance the original data set by adding new features. Afterthat, i tried to merge that merged data( new) with a new dataset from funda(funda). It is noticed that these 2 data sets do not share the same houses and the same info as the number of rows is zero, as funda is a selling houses company and not renting. So we are not going to use the 3rd database (new2)and therefore, it can be concluded that we continuing with the first merged dataset (new).

3. Data understanding and EDA

In here we will do EDA in the merged data and the original data. As during modeling phase we will compare between the metrics of the enhanced dataset and the original dataset.

3.1 Merged data set

We first start looking into the areaSqm column. What information can we get from there.

Using the describe function, we notice that the minimum value for the areaSqm is 9, the max is 120 and the average is almost 58. Let’s dive in further. The rent has a mean value if 1459 with a minimum of 540 and the maximum of 2750 to get more information.

By plotting the histogram we can see that it is right skewd, which means that it could give more errors and overestimates the outcome variable. In addition to that, it can be seen that the concentration is mostly between 0 and 20, that’s is our median, and this could tell us that most people could not afford big houses.

What can be told is that there is a positive correlation between the areaSqm and the rent. What can also be seen between ‘’rent’’ and ‘’areaSqm’’ using the ‘’property type’’ as a hue, is that apartments are on the higher end, studios are in the mid range and rooms are at the low range. Which is interesting as this can prove my hypothesis of area and property type has an impact on rent.

Here we plot an interactive map of the houses in amsterdam based on the longitude and latitude.

Using the box plot visualization, we can see some interesting EDA about the houses, as boxplots are useful for visualizing distibution, median, range and outliers, there fore the box plot is being used. Lets study them one by one. As it can be seen in the boxplot 50% of the data for apartments is concentrated on 1500-1800, the rest 50% is between the minimum and maximum of 1000-2300 removing the interquartile range. For the rooms type, 50% of the data fall on 700-900 and the rest is between 500-1100 removing the interquartile range. What is interesting here is that the apartments type seem to have a normal distribution following the formula Q3-Q2=Q2-Q1, with lots of outliers, same goes for the rooms. In addition to that what can be told from the box plots is that you can see the properties are over-lapping, which prevents the machine from differentiating between the different property types.

The number of bedrooms box plots, it can be seen that a house with 3 rooms is negativly skewd and a house with 1 room is positivley skewd, following the formula Q3-Q2<Q2-Q1 and Q3-Q2>Q2-Q1.

If we look at the showerboxplot, it can be seen that the shared shower seems to have a negative skew in the data unlike the owned which is positively skewd.

In the kitchen boxplot, it can be seen that "None " has only 1 line which shows that it is the least wanted feature. THe shared kitchen seems to have a huge distibution which is negativel skewd following the formula Q3-Q2<Q2-Q1. The own kitchen seems to have a normal distribution and slightely positively skewd.

The correlation of numrical variable, it can be seen Through this heatmap, that areasqm has a high correlation of 0.81 with the rent price.Furthermore, it is not as what was expected there is a low correlation between the rent and number of bedrooms. It can also be seen that the area square meter has a good correlation with the number of rooms but less witht he numbe rof bedrooms, but there is a corealtionof 0.6 beween the bedroms and longtitude.

It can be seen that rent has a high correlation with property type with a correlation of 0.67, it also has low correlation with bedrooms 0.07. Rent also has a low correlation between toilet, shower and kitchen but it also has a high coorelation with the rooms, living ,and area sqm.

Based on this correlation heat map we will take only the features that have a correlation with the rent that is higher than 30 and they are:

  1. toilet
  2. shower
  3. kitchen
  4. living space
  5. property type
  6. rooms
  7. area sqm
    as we want a moderate to high correlation and not low correkation.

3.2 Original data set(kamernet)

We first start looking into the areaSqm column. What information can we get from there.

Using the describe function, we notice that the minimum value for the areaSqm is 6, the max is 280 and the average is almost 32. Let’s dive in further, to get more information.

By plotting the histogram we can see that it is right skewd, which means that it could give more errors and overestimates the outcome variable. In addition to that, it can be seen that the concentration is mostly between 0 and 50, that’s is our median, and this could tell us that most people could not afford big houses.

According to what can be seen above is that the distribution of the rent is rightly skewd which means that it could lead to an overestimation as it is not a normal bell curve. but the median can be seen which seems to be around 700 euros.

What can be told is that there is a positive correlation between the areaSqm and the rent. What can also be seen between ‘’rent’’ and ‘’areaSqm’’ using the ‘’property type’’ as a hue, is that apartments are on the higher end, studios are in the mid range and rooms are at the low range.

Using the box plot visualization, we can see some interesting EDA about the houses. Lets study them one by one. As it can be seen in the boxplot of the property type, we can see only a line for the student resident, which means that we don’t have much information about it, but on the other hand, it seems that anti-squat houses are the cheapest and the less wanted and the median falls under 500. 50% of the data for apartments is concentrated on 1300-1800, the rest 50% is between the minimum and maximum of 800-2700 removing the interquartile range. For the rooms type, 50% of the data fall on 700-900 and the rest is between 400-1100 removing the interquartile range. For studios 50% of the data falls between 900-1100 and the rest is between 500-2000 removing the interquartile range. What is interesting here is that the apartments type seem to have a normal distribution following the formula Q3-Q2=Q2-Q1, with lots of outliers, but for the others we see that they are negatively skewd based on the formula of Q3-Q2<Q2-Q1.In addition to that what can be told from the box plots is that you can see the properties are over-lapping, which prevents the machine from differentiating between the different property types.

What can also be seen here which is interesting is that all the other box plots share the same distibution for owned and shared categories with alot of outliers.

Now we plot the prices on the map and Looking at the map we can see that the further you get away from the city the less expensive the rent is. This plot is based on latitiude and longitude.

The correlation of numrical variable, it can be seen Through this heatmap, that areasqm has a high correlation of 0.83 with the rent price.

it can be observed that there is a high correlation of 0.83 between the area square and the rent, 0.80 betweeen the property type and areasqm and a 0.77 correaltion between the property type and rent. the Other facilities like toilet, shower and kitchen all share the same correlation of 0.38.

4. Data preparation and hot encoding

In here we will prepare the data sets to be fed into the model by hot encode both datasets the merged data and the original data for the reason of comparison.

4.1 merged data

Since that most of the features that we have are categorical we are going to use a general approach that doesnt presume oredering. Therefore, we are going to use hot encoding with binary numbers were the instance either belongs to 0 or 1. That is also because that regression algorithim do not take objects or strigns, it only accepts numerical values. Therefore, The best approach for our categorical values is to hot encode them to numericals with binary numbers were the instance either belongs to 0 or 1.

4.2 Original data

Since that most of the features that we have are categorical we are going to use a general approach that doesnt presume oredering. Therefore, we are going to use hot encoding with binary numbers were the instance either belongs to 0 or 1. That is also because that regression algorithim do not take objects or strigns, it only accepts numerical values. Therefore, The best approach for our categorical values is to hot encode them to numericals with binary numbers were the instance either belongs to 0 or 1.

So now we have hot encoded the data, so that every unique value per row has its own column and own Unique number as it can be seen above. In the next chapter we will preprocess these data sets and prepare them for modeling where we will have to drop the categorical Object columns and use the hot encode.

5. preprocessing

Here we split the data before we start modeling, to prevent over fitting and to obtain a realistic evaluation of the model.

5.1 merged dataset

here we will preproccess the merged dataset.

5.2 original dataset

here we will preproccess the original dataset.

6. modeling

In here we are going to start the modeling and we are going to use various regression algorithims to find the most suitable one. The reason to why we use regression in here, it is because it is a prediction of how much and so we use regression. Due to regession having multiple algorithims, we will use a couple of them and compare between them. Furthermore, we would do the modeling on the merged data set and the original dataset to compare. finally we will also do the modeling between both data sets with and without hyper parameter tuning.

6.1 No hyperparameter tuning

6.1.1. Merged dataset

here we start with the merged dataset without any hyperparameter tuning.

6.1.1.1. Linear regression

linear regression is an analysis that is used to forecast a value based on a value of a different column. It gives better results if there is a linearity between the variables.

As it can be seen above, this model has a slope of 9.649 which is a positive slope and this tells u that there is a linearity between x and y as with an intercepts(b0)397329.162 but if it was 0 then there would be no linearity. based on this data we could say that the regression's line formula is y=9.649m+397329.162.

The mean squared error is very high as it can be seen for the traiining dataset it is 57211.446 and for the test it is 120919.946 . This is the distnace between the residuals and the line. This implies on how the training is doing better than testing which is not what we are interested in.

as it can be seen above the R square value for training is 77 % and for testing is 57% this is the percentage of how much of this data can this model predict. this tells us that the model cant predict the unseen data correctly. In addtion to that, if we look at te RMSE which is the standard deviation of the riseduals, it has a RMSE of 347.7 and this tells us that if we used this model to predict the rent price our model would be off by 347 euros.

it can be seen that this model in the first iteration had 52 % accuray and it kept increasing till the 5th iteration it decreases to 72% with a mean value of 64%.

6.1.1.2 Decision tree regression

Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.

As it can be seen the r squred for the training data set is really high as it is 99% but the testing did poorly as it as a r squared of 92%. This indicates that the testing is doing poorlier than the training with 7%.This is very good as this shows that this model can predict unseen data correctly. fruther more if we look at the RMSE our model would be off prediction by 144 euros.In addition to that the testing mse is lower than the training mse which indicates that this model is really good.

it can be seen that this model in the first iteration had 91 % accuray and it kept increasing till the 3th iteration it decreases to 29% and then increases with a mean value of 74%. now lets see the distribution between the original val and predicted.

we see the distibution is almost a bell shape curve with a bit of skew to the right. we cant make the conclusion that our model is working good through this.this might imply that there will be an over estimation

We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.

6.1.1.3 Ridge regression

Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.

as it can be seen above, the r squared for the testing is less than the training with 18%. The RMSE indicated that this model is off with its calculaions by 340 euros. this isnt really much good of a model to be used on our data set maybe after hypertuning, it might give better results in the next chapter

6.1.1.4 Randomforest regression

random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees.

as it can be seen above using the random forest model, the r squared of the training data set is very high as it is 96% and the testing data is 86 %. furthermore, the RMSe indicates that this model is off by 194 euros. let is first cross validate this using kfold.

it can be seen that as the iterations increase the cross validation accuray decreases witha mean of 79% as we train the data with 1 fold and test witht he rest of folds.Now lets see how it is distributed.

it can be seen that it alsomst forms a bell curve shape but it is a bit skewd on the right. This might imply that there will be an overestimation in the price.

We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.

6.1.2 original dataset

6.1.2.1. Linear regression

linear regression is an analysis that is used to forecast a value based on a value of a different column. It gives better results if there is a linearity between the variables.

As it can be seen above, this model has a slope of 9.926 which is a positive slope and this tells u that there is a linearity between x and y as with an intercepts(b0)-25009.332 but if it was 0 then there would be no linearity. based on this data we could say that the regression's line formula is y=9.926m-25009.332.

The mean squared error is very high as it can be seen for the traiining dataset it is 71900.957 and for the test it is 68496.723 . This is the distnace between the residuals and the line. This implies on how the testing is doing better than training which is what we are interested in.

as it can be seen above the R square value for training is 72 % and for testing is 74% this is the percentage of how much of this data can this model predict. this tells us that the model cant predict the unseen data correctly. In addtion to that, if we look at te RMSE which is the standard deviation of the riseduals, it has a RMSE of 261.7 and this tells us that if we used this model to predict the rent price our model would be off by 261.7 euros.

it can be seen that this model in the first iteration had 70 % accuray and it kept increasing till the 5th iteration it decreases to 71% with a mean value of 72%.

6.1.2.2 Decision tree regression

Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.

As it can be seen the r squred for the training data set is really high as it is 99% but the testing did poorly as it as a r squared of 72%. This indicates that the testing is doing poorlier than the training with 27%.This is a bit bad as this shows that this model can predict unseen data uncorrectly. fruther more if we look at the RMSE our model would be off prediction by 272 euros.In addition to that the testing mse is higher than the training mse which indicates that this model can be improved

it can be seen that this model in the first iteration had 58 % accuray and it kept increasing till the 3th iteration it decreases to 67% and then increases with a mean value of 65%. now lets see the distribution between the original val and predicted.

we can see that it forms a perfect bell shape curve which indicates that it is somewhat accurte.

We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.

6.1.2.3 Ridge regression

Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.

as it can be seen above, the r squared for the testing is more than the training with 2%. The RMSE indicated that this model is off with its calculaions by 261 euros. this isnt really much good of a model to be used on our data set maybe after hypertuning, it might give better results in the next chapter

6.1.2.4 Randomforest regression

random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees.

as it can be seen above using the random forest model, the r squared of the training data set is very high as it is 96% and the testing data is 82 %. furthermore, the RMSe indicates that this model is off by 215 euros. let is first cross validate this using kfold.

it can be seen that as the iterations increase the cross validation accuray decreases witha mean of 80% as we train the data with 1 fold and test witht he rest of folds.Now lets see how it is distributed.

it can be seen that it forms a bell curve witht the median on the zero.

We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.

6.2 hyper parameter tuning

6.2.1 merged dataset

6.2.1.1 linear regression

There werent any much parameters that were needed so this will be empty as we use the default.

6.2.1.2 Decision tree

Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.
now we will use grid search to find the best parameters tuning for this model.

Running the code below could take 15-20 min to search for the parameters.

The output of the previous code are
{'max_depth': 5,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.1,
'splitter': 'best'}

And now lets fit and predict our variable target using the hyper tuning parameters.

It can be seen that if we compare the tuned model and the non tuned model, it shows that the non tuned has scored better. as the tuned model had a r test score of 75% and a test mse of 69031. On the otherhand, the untuned model had a r2test of 92% and test mse of 20950 which is closer to zero than the tuned model. could that be due to over fitting ??

6.2.1.3 Ridge Regression

Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.

Comparing the tuned model and the non tuned model, it can clearly be seen that the tuned is more accurate than the non tuned as the r squared value of the testing data is 63% compared to the non tuned of 59%. In addition to that this model is over estimating by 322 euros which is less than the non tuned model which was 340.

6.2.1.4 Randomforest regression

random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees. now we will use grid search to find the best parameters tuning for this model.

Running the code above would take around 15 minutes. But the best parameters of the grid search were:
{'max_depth': 20,
'min_samples_leaf': 2,
'min_samples_split': 2,
'n_estimators': 100}

This model got the score of 86% for testing and RMSE of 199 which is when compared to the non tuned model we see that it scored lower as the non tuned parameter had an RMSE of 194 and a test score of 86%, is this due tot the fact that the parameters arent done correctly?

6.2.2 original dataset

6.2.2.1 linear regression

There werent any much parameters that were needed so this will be empty as we use the default.

6.2.2.2 Decision tree

Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.
now we will use grid search to find the best parameters tuning for this model.

Running the code below could take 15-20 min to search for the parameters.

The output of the previous code are
{'max_depth': 5,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.1,
'splitter': 'best'}

And now lets fit and predict our variable target using the hyper tuning parameters.

It can be seen that if we compare the tuned model and the non tuned model, it shows that the non tuned has scored better. as the tuned model had a r test score of 67% and a test mse of 87167.366. On the otherhand, the untuned model had a r2test of 72% and test mse of 74473.942 which is closer to zero than the tuned model. could that be due to over fitting ??

6.2.2.3 Ridge Regression

Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.

Comparing the tuned model and the non tuned model, it can clearly be seen that the tuned is more accurate than the non tuned but still almost the same values as the r squared value of the testing data is 74.34% compared to the non tuned of 74.31%. In addition to that this model is over estimating by 261.5 euros which is less than the non tuned model which was 261.7

6.2.2.4 Randomforest regression

random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees. now we will use grid search to find the best parameters tuning for this model.

Running the code above would take about 30 minutes. After doing the grid search on the best parameters it shows that these are the best parameters

{'bootstrap': 'True',
'max_depth': 20,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 2,
'n_estimators': 600}

As it can be seen above the training did better than the testing as the trainig has a r squared of 96 and the testing 83%. we can also see that the RMSE has a value of 212 which means that this model overestimates the price by that amount in eruos. Finally, it can be noticed that the MSE value of the testing is higher than the training, this implies that there is an over fitting.

it can clearly be seen that the non tuned model did better than the tuned model as the test score of the non tuned was 82% compared to the tuned which has 72%. The difference between the RMSE of the non tuned and tuned is 54 euros in favor of the nontuned model.

7. Trying new features to predict

In here we try to predict the price based on desired features.

7.1. based on non-enhanced data set and default hyper parameters

7.1.1. linear regression

based on the table below we follow the format of the table while filling an array with the desired features.

7.1.2. Decison tree

7.1.3. Ridge regression

7.1.4 Random forest regression

7.2 Based on enhanced dataset and hyper parameter tuning

7.2.1 linear regression

based on the table below we follow the format of the table while filling an array with the desired features.

7.2.2 Decison tree

7.2.3 Ridge regression

7.2.4 Random forest regression

7.3 Based on non-enhanced dataset and hyper parameter tuning

based on the table below we follow the format of the table while filling an array with the desired features.

7.3.1 linear regression

7.3.2 Decison tree

7.3.3 Ridge regression

7.3.4 Random forest regression

7.4 Based on enhanced dataset and non hyper parameter tuning

based on the table below we follow the format of the table while filling an array with the desired features.

7.4.1 linear regression

7.4.2 Decison tree

7.4.3 Ridge regression

7.4.4 Random forest regression

8. Models comparison

In here we will compare the models and the datasets as which was better and elaborate more on why.

8.1. based on enhanced data set and default hyper parameters

It can be seen that when comparing the models of the merged dataset with default parameters, that the Decision tree and random forest has above 80% for testing r squared and almost 100% in training. This is very interesting as the random forest was expected to do better due to it ppreventing overfitting which implies that my decison tree model is overfit, based on the testing error is higher than the training. Because the model is trained well on these points but when new points are added it does worse.
Below we will explore more on this and compare.

8.2. Based on enhanced dataset and hyper parameter tuning

This is very interesting as when tuned the random forest model did the best and better than the decison tree where it is above 80%. and that is due to the features of the random forest, as it disregards the noise of the data, which prevents overfitting, therefore a higher accuray than the decision tree which is easily prone to overfitting. meanwhile if we compare between the ridge and linear regression we see that the ridge is performig better than the linear in testing and linear is performing better in training and that is due to the fact that our features are too much for it to handle so it overfits so ridge is recommended. that is becaue ridge and lasso are considered as regulization algorithims, which avoids over fitting.

Now if we compare between the non tuned and tuned models we will notice that the non tuned models are doing better than the tuned and that could be either due to the new parameters being worst than the default or that after tuning the model we made it balanced and not overfitted.

8.3 Based on non-enhanced dataset and hyper parameter tuning

Now if we compare the models of the tuned origianl dataset, we notice that in training the random forest does the best then comes the liner, ridge and decison tree. On the otherhand, in the testing random forest has an r ssquared of 83% compared to the ridge which had almost 80%.

when tuned, the random forest model did the best and better than the decison tree where it is above 80%. and that is due to the features of the random forest, as it disregards the noise of the data, which prevents overfitting, therefore a higher accuray than the decision tree which is easily prone to overfitting. meanwhile if we compare between the ridge and linear regression we see that the ridge is performig better than the linear in testing and linear is performing better in training and that is due to the fact that our features are too much for it to handle so it overfits so ridge is recommended. that is becaue ridge and lasso are considered as regulization algorithims, which avoids over fitting.

8.4 Based on non-enhanced dataset and default hyper parameters

What is interesting about this is that, when non tuned the model seems to have a high accuracy. if we compare between the decison tree and the random forest, we see that the decison tree has almost a 100% accuray on training and around 70% on testing, and this is due to the algorithim being overfitted. on the otherhand, the random forest has around 90% accuracy on training and around 80% on testing this is due to the natureof the random forest as it stops the model from overfitting.

9. conclusion

what can be seen is that, the difference between the datasets is that, the more data is included the higher the accuracy is. that can be seen above as the non tuned original data set highest training score was 99% and the testing was 82%, measured in r squared. on the other hand the non tuned enhanced data set highest training score was 99% and testing was 92%. This already shows that more data means more accuracy. Now, if we compare the datasets of the tuned original data set and tuned enhanced data set, we see that that the original data set has a highest score of training 96% and testing of 83%. compared to the endhanced data set which had a highest training of 93% and testing of 86%. This shows that the more data added the better the accuracy is. In addition to that, if we look at the performace and why does tuning the parameters matter, we will notice that the reason behind the accuracy is being high before tuning is because the algorithim was being overfitted. That can be seen in the Mean squared error of the algorithims. As before tuning, we notice that the testing MSE is higher than the training MSE, abnd this implies that the model was overfitted. After tuning the models, we noticed that the training MSE value was getting closer to the range of the testing MSE. An example of that is the decison tree algorithim of the original dataset. If we look at that section we will notice that before tuning the model the training MSE was 1785 and testing MSE was 74473, which clearrly shows that this is overfitted. After tuning the model, we notice that the training MSE error is 89764 and the Testing MSE is 87167.Therefore, hyper tuning the parameters and enhancing the data are iportant factors in achiving high accuracy results.

According to the interview that were done with the target audiance and the domain expert, this project can reduce the time taken to search for reliable houses, can predict the the price worth of the house, with a price margin (i.e., RMSE), due to the model being overestimating, giving new coming student a base line for price negotitation, and shows the price margin between the predicted price and actual price to flag the houses as scam.

Below is the table that was fit into the models, in this table the price predicted column will be added. the table will be saved as a CSV and be used in a powerBI application. Please check phase 4 document.